Frontiers in Artificial Intelligence
○ Frontiers Media SA
Preprints posted in the last 30 days, ranked by how well they match Frontiers in Artificial Intelligence's content profile, based on 18 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.
Show abstract
Background: Machine-learning models based on circulating biomarkers are increasingly used in cardiovascular research; however, model performance alone provides limited insight into how the predictive signal is distributed across features. We aimed to characterize the biomarker signal architecture of a machine-learning model distinguishing ST-elevation myocardial infarction (STEMI) from non-ST-elevation myocardial infarction (NSTEMI), with a focus on signal concentration, redundancy, and conditional complementarity. Methods: We conducted a structured secondary analysis of a previously established, leakage-controlled machine-learning framework (n = 152 patients). The BIOMARKERS feature-set variant (10 biomarkers) was evaluated using outer-fold cross-validation. Model structure was interrogated using (i) leave-one-biomarker-out analysis, (ii) pairwise leave-two-out analysis with pair-excess estimation, (iii) cumulative ablation of top-ranked biomarkers, and (iv) forward reconstruction of minimal biomarker panels. Uncertainty was assessed using bootstrap resampling across folds. Results: The full biomarker model achieved a mean ROC-AUC approaching 0.94. The predictive signal was highly non-uniform, with MMP-2 showing the largest single-feature contribution (mean {Delta}AUC {approx} 0.16). Pairwise analysis identified conditional complementarity between selected non-lipid biomarkers, particularly MMP-2 and EMMPRIN (pair {Delta}AUC {approx} 0.26; positive excess over single-feature effects), whereas lipid-related markers formed a highly correlated and largely redundant sub-cluster. Cumulative ablation demonstrated rapid performance collapse following removal of top-ranked biomarkers, consistent with structural signal concentration. Forward panel analysis showed that a compact subset of biomarkers (three features) achieved performance within ~0.01 ROC-AUC of the full model, indicating the presence of a minimal high-yield panel. Bootstrap confidence intervals suggested that small performance differences should be interpreted with caution. Conclusions: Predictive performance in this biomarker-based model arises from a structured and unevenly distributed signal architecture, characterized by a dominant core biomarker, conditionally complementary contributors, and a redundant lipid cluster. These findings highlight the importance of evaluating model structure, not only aggregate performance, and suggest that biomarker-based machine-learning systems may benefit from architecture-aware interpretation and simplification strategies.
Serrano, A. E.
Show abstract
Machine learning (ML) has emerged as a transformative technology across biomedical and life science sectors, with applications spanning drug discovery, medical imaging, genomics, and clinical decision support (Goecks et al., 2020; Patel et al., 2020). Despite exponential growth in ML-related publications, from fewer than 100 articles in 2003 to nearly 25,000 by 2021 (NCBI, 2022), adoption among industry professionals remains uneven and sector-dependent. Understanding what drives or inhibits this adoption is critical for organisations seeking to leverage ML capabilities in research and clinical practice. Technology adoption in organisational contexts has been extensively studied through the Technology Acceptance Model (TAM), originally proposed by Davis (1989) and subsequently extended to incorporate external variables influencing perceived usefulness (PU) and perceived ease of use (PEU) (Venkatesh & Davis, 1996). While TAM has been applied across multiple industries, its application within biomedical and life science contexts remains limited, and the industry-specific factors that shape ML acceptance in this sector have not been systematically examined. Two external variables are particularly relevant to life science professionals. First, the bibliometric journal impact factor (JIF) functions as a cognitive signal of scientific credibility, a sector where evidence-based decision-making is culturally embedded, and publication quality serves as a proxy for technological legitimacy (Garfield, 1996). Second, technology hype, operationalised through the Gartner Hype Cycle framework, represents a social influence variable that shapes organisational expectations and investment decisions around emerging technologies (Gartner Inc., 2018). Whether these variables influence ML acceptance among life science professionals, alongside individual knowledge and experience, has not been empirically tested. This study addresses that gap by investigating ML technology acceptance among 213 biomedical and life science professionals across EMEA, LATAM, and North America, using a cross-sectional quantitative survey and PLS-SEM analysis. The TAM model is extended with three external variables, JIF, technology hype, and prior knowledge and experience, to test their influence on PU and PEU in this specific professional context. Additionally, the study examines demographic and regional differences in ML acceptance, with particular attention to variation between academic researchers and healthcare professionals. The findings contribute a validated, sector-specific extension of TAM for life sciences, provide actionable insights for organisations seeking to accelerate ML implementation, and establish a framework for future subsector-specific research.
Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.
Show abstract
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.
Alsammani, A.; Johnson, M.; Elrefaei, J.
Show abstract
Objective: To develop, calibrate, and interpret machine learning models for predicting in-hospital mortality among intensive care unit (ICU) patients using clinical data collected during the first 24 hours of admission. Methods: We analyzed 53,866 adult ICU admissions from the MIMIC-IV (v2.2) database, including 5,787 in-hospital deaths (10.7%). An enhanced feature-engineering pipeline generated 88 laboratory-based features that captured distributional characteristics, temporal trends, and measurement frequency. Five machine learning classifiers were evaluated: L2-regularized logistic regression, random forest, XGBoost, LightGBM, and a calibrated soft-voting ensemble. Models were developed using a stratified 64:8:8:20 split for training, validation and hyperparameter tuning, calibration, and testing. Performance was assessed on a held-out test set (n = 10,774) using the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), Brier score, calibration analysis, decision curve analysis (DCA), and SHAP-based model interpretation. Results: The calibrated ensemble achieved the best overall performance, with an AUROC of 0.856 (95% CI: 0.846-0.867), an AUPRC of 0.449 (95% CI: 0.418-0.480), and a Brier score of 0.078. XGBoost (AUROC 0.856; AUPRC 0.435) and LightGBM (AUROC 0.854; AUPRC 0.436) demonstrated performance comparable to the ensemble and significantly outperformed logistic regression (AUROC 0.823; AUPRC 0.376), yielding absolute AUROC improvements of approximately 0.031-0.033 (p < 0.001). Calibration substantially improved probabilistic predictions, reducing Brier scores by 42% for XGBoost (0.134 to 0.078) and 50% for LightGBM (0.151 to 0.076). Decision curve analysis demonstrated consistent net clinical benefit across the 5%-20% risk-threshold range. Key predictors included age, blood urea nitrogen, ICU subtype, measurement frequency, and lactate-related features. Model performance remained robust across ICU subtypes, with AUROC values exceeding 0.79. Conclusion: A calibrated and interpretable machine learning framework based on early ICU clinical data provides accurate and clinically actionable mortality risk estimates. By integrating trajectory-aware feature engineering, probabilistic calibration, and decision-analytic evaluation, this approach advances ICU mortality prediction toward more reliable and trustworthy clinical decision support systems.
Choi, J.; Kim, Y. J.; Lyu, P.; Luan, Y. L.; Toh, S. M.
Show abstract
Artificial intelligence (AI) is increasingly incorporated into diagnostic decision-making, raising questions about physician responsibility following AI-involved adverse diagnostic events. Explainable AI (XAI) has been proposed to improve transparency and trust, but its influence on public reactions remains unclear. In a randomised vignette-based experiment, 652 adults from the United States and United Kingdom were assigned to one of six conditions in a 3 (diagnostic source: AI alone, human radiologist alone, or human-AI collaboration) x 2 (explanation: present or absent) between-subjects design. Participants read a scenario in which a chest X-ray was initially interpreted as normal but lung cancer was diagnosed five months later, indicating that the original interpretation had missed the cancer. In explanation conditions, participants received additional information about how the diagnosis had been reached, including AI heatmap-based explanations in the AI conditions. Participants rated radiologist responsibility, likelihood of complaint, and intention to pursue legal action. Among 652 participants (mean age 42.2 years; 50.2% female), responsibility ratings were significantly lower when AI alone made the diagnostic decision (mean 4.73, 95% CI 4.53-4.93) compared with human-only decision-making (5.78, 95% CI 5.59-5.98; p<0.001) and human-AI collaboration (5.54, 95% CI 5.34-5.74; p<0.001). Complaint likelihood showed a similar pattern. Intentions to pursue legal action followed the same directional trend but were marginally significant. Neither explanations nor explanation-by-source interactions were associated with outcome measures. These findings suggest that the public expects physicians to remain accountable when AI is involved in diagnostic decision-making, particularly in collaborative settings. Providing explanatory information about how AI systems reach decisions may be insufficient to change perceptions of physician responsibility following adverse diagnostic events.
Overmars, L. M.; Allaart, C.; Bron, E. E.; Brunner La Rocca, H.-P.; de Bresser, J.; Muller, M.; van Osch, M. J. P.; Teunissen, C.; Tijms, B. M.; Wolters, F. J.; Biessels, G. J.; Heart-Brain Connection Consortium,
Show abstract
Background: Vascular cognitive impairment (VCI) and small vessel disease (SVD) involve many interconnected factors influencing multiple outcomes, also beyond cognitive decline. Bayesian networks (BNs) can help unravel these complex interrelations, which we demonstrate in this proof-of-concept study in the Heart-Brain Connection cohort, including memory-clinic patients with SVD, patients with heart failure, carotid occlusive disease, and reference participants. Methods: We trained BNs and jointly modelled cognitive decline (Clinical Dementia Rating (CDR) increase) and major adverse cardiovascular events (MACE) over five years as outcomes in relation to multiple demographic and disease factors and emerging imaging and plasma biomarkers, also considering possible non-random dropout. Results: Of 566 individuals (median age 68, 64% men), 134 had MACE and 112 experienced CDR increase. Diagnostic group and baseline cognition were key determinants of both outcomes. The BN identified baseline clinical severity as a non-random dropout source. Plasma biomarkers formed an interconnected subnetwork, linked to demographic and vascular factors, but without direct dependencies with outcomes. The trained BN also provides individualized inference under partial evidence, informing on outcome probabilities. Conclusion: This proof-of-concept study demonstrates how BNs quantify and visualize the dependency structure underlying prognostic heterogeneity in VCI and SVD, including non-random dropout and positioning of emerging biomarkers.
Jean, A.; Merceron, A.; Le Saux, A.; Mercier, E.; Benillouche, P.
Show abstract
This study aims to assess women's perceptions of artificial intelligence (AI) used in breast cancer screening in France by examining their knowledge of AI and the barriers to their participation in organized screening. The results of a survey conducted in June 2025 among a national sample of 2000 women (aged 40-75) reveal limited participation and persistent concerns among women. Nevertheless, despite a low awareness of specific AI applications, a large majority of the women surveyed are very favorable to the use of AI in breast cancer diagnosis, even considering it a lever to increase screening participation.
Xiao, J.; Zhao, Z.; King, Z. D.; Khalid, M.; Davies, S.; Zanna, K.; Argueta, D. L.; Brice, K. N.; Wu-Chung, E. L.; Lai, V. D.; Paoletti-Hatcher, J.; Denny, B. T.; Henry, S.; Schulz, P. E.; Fagundes, C. P.; Sano, A.
Show abstract
Spousal caregivers of individuals with Alzheimers disease and related dementias frequently experience elevated perceived stress, caregiver burden, and loneliness, which are associated with adverse health outcomes. Early identification is therefore critical for timely intervention. Existing approaches commonly rely on wearable sensor data and standardized psychological questionnaires, while recent multimodal methods aim to improve prediction by integrating behavioral and linguistic information. In this study, we explored three modality configurations, wearable-derived features, interview-based text, and their combination, to classify caregiver psychological risk using the Perceived Stress Scale (PSS), Zarit Burden Interview, and UCLA Loneliness Scale. We compared traditional machine learning models and large language models (LLMs) (Gemini 2.0, Llama 4, and GPT-4o) under psychometrician-centered and caregiver-centered prompting strategies. Traditional machine learning models performed better under multimodal settings, while LLMs achieved stronger performance with Interview-Only input. We further demonstrate that PSS was the most predictable construct and prompting strategies substantially influenced LLM performance.
Kelly, R. E.
Show abstract
Null Hypothesis Significance Testing (NHST) remains the dominant paradigm for evaluation of empirical research findings in medicine and the social sciences despite concerns about frequent misinterpretations of those findings. Achievement of "statistical significance," the goal of NHST, often beckons unrealistic conclusions. Helpful would be the addition of a broader, Bayesian perspective of research in terms of progressive readjustment of hypothesis credibility from all sources of evidence. For this purpose, the Hypothesis Race Model (HRM) provides an intuitive Bayesian approach that builds upon NHST-concepts, helping to correct misunderstandings with minimal reeducation. The HRM is an extension of the Bayesian approach by Ioannidis in 2005 that helped to explain "why most published research findings are false." It is powerful enough to serve as the foundation for mathematical models to estimate and reduce the cost of empirical hypothesis testing.
Periwal, V.
Show abstract
Background: Conventional psychiatric screening instruments summarize symptoms within individual scales and prioritize cases with high single-instrument additive score severity. This design treats items as independent within instruments and ignores cross-instrument covariance structure, making it insensitive to respondents whose responses are distributed across multiple domains in unusual combinations that remain below threshold on every individual scale. Methods: We analyzed two cohorts spanning older and younger adults. Item prompts from depression, stress, anxiety, and sleep instruments were embedded into a shared semantic space using a pretrained sentence encoder. Principal component analysis of the item-prompt embeddings alone---with no use of respondent data at this stage---was used to construct a low-dimensional subspace retaining 80\% of variance in the item embedding matrix. Normalized participant responses were then projected into this subspace, with Jaccard-based stability analysis used as a check on dimensional robustness. Multivariate deviation from the cohort norm was quantified with Mahalanobis distance using Ledoit-Wolf covariance regularization. Candidate outliers were defined by the empirical 95th percentile of the cohort-specific distance distribution. To isolate response configurations not already captured by conventional single-instrument extreme-value logic, we excluded all outlier respondents who had endorsed any individual item at the maximum value of its Likert scale on any instrument. For the remaining outliers, anomalous components were backtracked to their original item loadings for interpretation. Results: In the older-adult Health and Retirement Study (HRS) cohort, principal component analysis of 27 item-prompt embeddings showed that a 10-dimensional subspace provided a stable representation of cross-instrument semantic structure. In the younger-adult Xinxiang cohort the corresponding stable solution was 16-dimensional. In each cohort, seven respondents remained as multivariate outliers despite falling below every single-instrument extreme-value threshold. These cases were not characterized by uniformly severe symptom scores but by unusual cross-domain response configurations that became visible only in the shared semantic covariance subspace. The response structure of the retained configurations differed across cohorts: older-adult cases more often involved weak endorsement of mood-labeled items alongside nonzero body- and sleep-related responses, whereas younger-adult cases more often involved incomplete response configurations spanning mood, sleep, stress, and self-harm-related items. Conclusions: A semantically aligned, auditable covariance subspace provides a practical tool for flagging unusual multivariate response configurations that single-instrument additive screening may not flag. The method is interpretable at the level of original item contributions. It should be understood as a hypothesis-generating screen for unusual response configurations requiring further clinical assessment, not as a diagnostic instrument. Outcome validity remains to be established by prospective study.
Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.
Show abstract
Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.
Alickovic, F.; Lenz, S.; Ustjanzew, A.; Ortiz Rosario, L.; Vollmar, G. M.; Kindler, T.; Panholzer, T.
Show abstract
Introduction Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language mod-els (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations. Methods We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from stand-ardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B. Results Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM perfor-mance and outperformed embedding-based approaches on partial-match accura-cy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy. Conclusion A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potential-ly supported by larger and more diverse training data, may offer a promising direction for future work.
Islam, N.; Luo, C.; Tong, J.; Weller, G.; Polleya, D. A.; Kent, A.; Bair, S.
Show abstract
Introduction In analyses of time-to-event data, clinical characteristics can have non-linear impacts on survival outcomes, and understanding this dynamic behavior is crucial for producing real-world evidence (RWE). Nonetheless, estimating these dynamic effects is inherently challenging when utilizing real-world data (RWD), especially since sharing individual-level patient data (IPD) is heavily restricted due to regulatory limitations. Additionally, computational difficulties are exacerbated by the high dimensionality, inter-dependency, rarity, sparsity, and scarcity of features. While data augmentation through collaboration across multiple sites might address these challenges, such collaboration is often infeasible and hindered by regulatory measures that protect patient privacy, thereby preventing the sharing of IPD between sites. Objectives To address this challenge, we propose a privacy-preserving regularized algorithm that eliminates the necessity of aggregating any protected health information across sites. This algorithm employs a penalized federated additive model utilizing piecewise exponential survival (FAMES) data and estimates non-linear effects of features while accounting for non-varying confounding effects. The model is flexible and can accommodate both multiple and multivariate smooth effects simultaneously. Methods The proposed model transforms survival data into a piecewise exponential data (PED) structure and casts the semi-parametric optimization problem into a generalized additive modeling framework assuming Poisson distribution. The model uses orthonormal splines to approximate non-linear effects and incorporates L2-norm based penalty terms to control the smoothness and goodness-of-fit of these effects. The algorithm is optimized using site-specific aggregated summary statistics and is solved iteratively through the Newton-Raphson method. Results The model is employed to assess the smooth effects of clinical features, such as age and numeric laboratory values, on overall survival using RWD from approximately 874 newly diagnosed Acute Myeloid Leukemia (AML) patients treated at seven distinct sites in the United States. The model exhibited non-linear smooth effects for lactate dehydrogenase, platelets, and others underscoring their strong association with disease prognosis. The model demonstrates a lossless property, providing estimates of smooth and fixed effects that are comparable to those derived from the pooled PED. Additionally, the inference of parameters for testing the nullity of effects remains consistent. This model is communication-efficient, necessitating roughly twelve rounds of communication across sites. Conclusion We anticipate that this model can facilitate multisite collaboration and enable smaller sites to participate in generating and validating RWE, especially for rare diseases. While the model was applied within the context of AML, it is disease-agnostic and can be implemented in any other clinical context and across various sites globally without losing any generality.
Eskandarian, M.; Malekpour, S. A.
Show abstract
PurposeIn clinical practice, accurate prediction of disease risk must be accompanied by transparent, human-understandable explanations to support diagnostic confidence, guide therapeutic decisions, and meet ethical and regulatory standards. While deep neural networks achieve high predictive performance in tasks such as cancer detection and diabetes risk stratification, their black-box nature prevents clinicians from understanding the reasoning behind predictions, severely limiting trust and safe integration into patient care. MethodsWe present Regression-Based Boolean Rule (RBBR), a framework that automatically derives clinically interpretable Boolean rules directly from patient data. RBBR generates human-readable conjunctions (logical AND combinations) of up to three clinical features, transforms them into inputs for ridge regression to predict binary or multi-class disease outcomes, estimates rule importance via regularized coefficients, and selects the most parsimonious and predictive rule sets using the Bayesian Information Criterion. ResultsApplied to six real-world medical datasets (lung cancer screening and staging, Wisconsin and diagnostic breast cancer, heart failure, and early-stage diabetes risk), RBBR consistently produced concise, clinically meaningful rules - e.g., gender-specific symptom combinations in diabetes, distinct histopathological subpopulations in breast cancer, and symptom-risk factor interactions in lung cancer - with strong explanatory power (R2 up to 0.92) and competitive discrimination. ConclusionBy delivering logical, transparent decision rules aligned with clinical reasoning (if symptom A and B, then high risk), RBBR bridges the gap between predictive accuracy and bedside usability, enabling clinicians to validate predictions, identify high-risk patients, stratify subpopulations, and enhance shared decision-making in routine care.
Panchumarthi, L. Y.; Kataria, S.; Wu, Y.; Hu, X.; Fedorov, A.; Kwak, H. G.
Show abstract
Background. Fairness-aware machine learning increasingly targets demographic performance disparities in clinical prediction, yet whether standard bias mitigation strategies genuinely improve equity in physiological signal analysis remains unclear. Age-based disparities in photoplethysmography (PPG)-based heart rate prediction present a particular challenge, as age-related performance differences may reflect context-dependent physiological structure rather than correctable artifacts. Methods. We evaluated three fairness interventions, inverse-frequency weighting (IF), Group Distributionally Robust Optimization (GroupDRO), and adversarial debiasing (ADV), applied via fine-tuning of a PPG foundation model across three clinical datasets spanning intensive care unit, laboratory, and consumer wearable contexts. Outcomes were assessed using a 2x2 framework classifying each intervention-dataset combination by the joint direction of change in mean absolute error (MAE) and fairness gap (FG) across age groups, yielding four outcome types: genuine improvement (G), leveling down (L), selective benefit (S), and both worse (W). Results. Across nine intra-domain conditions, no intervention simultaneously improved both MAE and FG (0/9 genuine improvement). The dominant pattern was leveling down (5/9): FG decreased but was accompanied by MAE degradation, indicating that apparent fairness gains were achieved at the cost of overall predictive performance. Age-group difficulty ordering varied across clinical contexts at baseline and was not preserved under intervention. In 18 cross-domain transfer conditions, genuine improvement was rare (4/18) and observed exclusively in non-MIMIC source configurations; models fine-tuned on MIMIC-sourced data yielded no genuine improvements (0/6). Embedding-level representation changes following fine-tuning did not reliably predict fairness outcomes. Conclusions. Age-based fairness interventions in PPG heart rate prediction indicate a leveling-down pattern rather than genuine equity improvement, suggesting that age-related performance gaps reflect context-dependent physiological structure not fully addressable through standard bias mitigation. Cross-domain transfer further amplifies this instability. These findings suggest that fairness evaluation frameworks for age-stratified physiological prediction should account for context-dependent performance structure rather than treating observed gaps as correctable bias.
Daniel, L.-I.; Ros-Leon, A.; Molina-Rodriguez, S.; Pellicer-Porcar, O.; Cabrera-Perona, V.; Ibanez-Ballesteros, J.
Show abstract
The proliferation of gambling advertising has intensified concerns regarding its influence on vulnerable populations, yet the neural mechanisms underlying cue-reactivity to these stimuli remain underexplored in ecologically valid settings. This study protocol proposes a novel methodological framework to investigate prefrontal cortical responses to gambling advertisements in individuals with varying degrees of gambling experience. Materials and methods: This cross-sectional study will recruit 44 participants, divided into a clinical group (individuals with high-frequency gambling or gambling disorder) and a matched control group. Neural activity will be recorded using fNIRS while participants view gambling-related, neutral, violent, and sexual stimuli. Secondary measures include validated scales for gambling severity (SOGS), impulsivity, sensation seeking, and alexithymia. Data analysis will primarily utilize inter-subject correlation (ISC) to quantify neural synchronization and multiband frequency decomposition to capture dynamic affective processing. Advanced preprocessing, including short-channel regression, will be applied to ensure signal robustness. Discussion: By combining portable neuroimaging with a data-driven ISC approach, this study aims to identify objective neural markers of gambling vulnerability. The findings will provide novel insights into the idiosyncratic processing of commercial stimuli, potentially informing public health policies and the development of more effective evidence-based regulations for gambling marketing.
Teixeira, A. C. F. d. S. B.; Pereira, O. d. A.; Vasconcelos, J. P.; Alves, J. M. F.; Teixeira, C. E. C.
Show abstract
Introduction: Infectious and wound-healing complications after colorectal surgery often increase the complexity of local care and the need for specialized enterostomal therapy follow-up after hospital discharge. Despite the growing use of predictive models in digestive surgery, a translational gap remains between perioperative prediction and the practical organization of specialized care. Therefore, the aim of this study was to develop and temporally validate a machine-learning-based risk stratification model to estimate the probability of post-discharge outcomes associated with greater demand for enterostomal therapy after colorectal surgery. Methods: This was a retrospective observational study including 7,908 patients who underwent colorectal surgery between 2005 and 2014. The outcome was defined as the occurrence of superficial surgical site infection, delayed wound healing, or abdominal sinus formation. Routinely available preoperative and intraoperative variables were used as predictors. The primary model was based on gradient boosting with isotonic calibration. Temporal validation was performed by separating cohorts according to year of surgery. Performance was assessed using ROC-AUC, PR-AUC, Brier score, calibration, and decision-oriented clinical metrics. Clinical utility was examined through percentile-based risk stratification and Decision Curve Analysis (DCA). Results: The outcome prevalence in the test set was 6.6%. The calibrated model achieved a ROC-AUC of 0.64 and a PR-AUC of 0.11, with a Brier score of 0.061. The Top-10% risk stratum concentrated approximately twice the baseline event rate ({approx}14% vs. 6.6%), with a number needed for intensified follow-up of 7 patients to identify one event. Decision curve analysis showed greater net benefit than strategies of following all or no patients, particularly for threshold probabilities between 3% and 13%. Models based exclusively on preoperative or intraoperative variables performed worse than the combined model. Conclusion: STOMAPY demonstrated the ability to organize patients along a continuous gradient of risk for post-discharge outcomes associated with greater demand for enterostomal therapy. Although discriminatory performance was moderate, the adequate calibration, temporal validation, and net benefit observed across clinically plausible thresholds support its usefulness as a tool for proportional care prioritization rather than as an individual diagnostic test. Prospective studies and external validations are needed to confirm direct clinical impact.
Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.
Show abstract
Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.
Kurt, F.; Subasi, A.
Show abstract
Background: Traditional diagnostic models lack explainability, while multimodal language models prone to hallucination remain unsafe for medical education. An interactive, risk-free artificial intelligence framework is required to serve as a reliable clinical mentor for radiology trainees. Methods: We propose a multi-agent architecture decoupling deterministic image analysis from generative consultation. Specialized computer vision models perform anatomical localization and pathological segmentation. These quantitative outputs are synthesized into a structured payload, which grounds a locally hosted large language model (LLaVA 7B) using strict prompt guardrails and prerequisite protocols. Results: The system effectively eliminates visual hallucinations by intercepting unanchored queries. The artificial intelligence tutor successfully contextualizes spatial anomalies and baseline metrics, generating accurate conversational explanations and formally structured radiology reports while strictly enforcing medical safety disclaimers. Discussion and Conclusion: By anchoring language generation exclusively to verified algorithmic realities, this framework transforms opaque diagnostic models into safe, interactive educational simulators. This establishes a highly reliable paradigm for integrating explainable artificial intelligence into medical training.
Biswas, M. A.; Laila, A.
Show abstract
Background: Machine learning models trained on population health surveys offer scalable tools for cardiovascular screening, but recurring methodological weaknesses undermine their credibility and equity: data leakage from synthetic oversampling, qualitative rather than quantitative explainability evaluation, and the absence of demographic fairness auditing at the clinical operating threshold. Methods: We present EXHEART, a leakage-free stacked ensemble pipeline trained on BRFSS 2015 (n = 253,680) and validated on BRFSS 2020 (n = 319,795; temporal transport and retrain) and a clinical cardiovascular examination dataset (n = 68,730). The pipeline combines XGBoost, LightGBM, Random Forest, and a multi-layer perceptron as base learners with 5-fold out-of-fold logistic regression stacking and Platt scaling calibration. A quantitative SHAP-LIME consistency framework, based on Kendall-tau rank correlation and Jaccard overlap, accompanies a decision-curve analysis, a subgroup-stratified SHAP interaction analysis, and an intersectional fairness audit (Sex x Age x Income) with threshold-shifting mitigation and a frontier of the fairness-utility trade-off. The framework also adds cross-instrument fairness-disparity attribution, an empirical diagnostic that provides evidence on whether an observed subgroup disparity is more consistent with a measurement-induced or a substantive explanation by re-validating it on a dataset that measures the same clinical construct objectively. On heart disease, this diagnostic associates 89% of the sex TPR gap (95% CI [0.65, 0.99]) with the self-reported survey outcome rather than with a substantive risk difference. Results: On BRFSS 2015, EXHEART achieves AUC-ROC = 0.850, AUPRC = 0.371, Brier score = 0.071, and reduces ECE by 96% (0.256 to 0.011) via Platt scaling. Global SHAP-LIME rank agreement is moderate-to-strong (Kendall-tau = 0.580, Spearman-rho = 0.818) with a substantial top-3 divergence (Jaccard@3 = 0.200), where Stroke flips from SHAP rank 8 to LIME rank 1. The Sex TPR gap is 0.124 at the screening threshold; intersectional Sex x Age disparities reach 0.649 among adequately-powered cells, 5.2x the single-attribute gap. Temporal transport to BRFSS 2020 collapses sensitivity from 0.776 to 0.267, while retraining restores AUC = 0.840 and ECE = 0.012. On clinical examination data, the Sex TPR gap collapses to 0.014; the attribution test indicates this gap is instrument-dependent, consistent with a measurement or outcome-definition explanation rather than a substantive risk difference. Cross-domain SHAP analysis identifies four instrument-independent CVD risk factors and two major portability failures. Conclusions: EXHEART combines three practices that population-scale cardiovascular classifiers usually apply in isolation: leakage-free training with calibrated probabilities, a test of whether the model's explanations are stable, and a fairness audit that examines intersecting subgroups rather than single attributes. Bringing them together proved worthwhile. The intersectional audit revealed disparities that single-attribute auditing missed, and the cross-instrument comparison indicated that much of the sex gap reflects how the outcome is measured in survey data rather than a substantive difference in risk. The temporal transport findings indicate that deployed BRFSS models warrant periodic monitoring and retraining to maintain clinical utility. EXHEART is a retrospective methodological evaluation on public de-identified data; it is not validated for direct clinical decision-making, diagnosis, or treatment recommendation without prospective clinical validation.